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Abstract 

We propose a new detection algorithm that uses structural 
relationships between senders and recipients of email as 
the basis for the identification of spam messages. Users 
and receivers are represented as vectors in their reciprocal 
spaces. A measure of similarity between vectors is con- 
structed and used to group users into clusters. Knowledge 
of their classification as past senders/receivers of spam or 
legitimate mail, comming from an auxiliary detection al- 
gorithm, is then used to label these clusters probabilisti- 
cally. This knowledge comes from an auxiliary algorithm. 
The measure of similarity between the sender and receiver 
sets of a new message to the center vector of clusters is 
then used to asses the possibility of that message being le- 
gitimate or spam. We show that the proposed algorithm is 
able to correct part of the false positives (legitimate mes- 
sages classified as spam) using a testbed of one week smtp 
log. 



1 Introduction 

The relentless rise in spam email traffic, now accounting 
for about 83% of all incoming messages, up from 24% in 
January 2003 [13], is becoming one of the greatest threats 
to the use of email as a form of communication. 

The greatest problem in detecting spam stems from 
active adversarial efforts to thwart classification. Spam 
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senders use a multitude of techniques based on knowl- 
edge of current detection algorithms, to evade detection. 
These techniques range from changes in the way text is 
written - so that it can not be directly analyzed computa- 
tionally, but can be understood by humans naturally - to 
frequent changes in other elements, such as user names, 
domains, subjects, etc. Therefore, good choices for spam 
identifiers are becoming increasingly more difficult. 

In the light of this enormous variability the question 
then is: what are the identifiers of spam that are most 
costly to change, from the point of view of the sender? 
The limitations of attempts to recognize spam by analyz- 
ing content are clear [6], Content-based techniques [16, 
21, 17] have to cope with the constant changes in the way 
spammers generate their solicitations. The structure of 
the target space for these solicitations tends however to be 
much more stable since spams senders still need to reach 
recipients, even if under forged identifiers, in order to be 
effective. Specifically by structure we mean the space of 
recipients targeted by a spam sender, as well as the space 
of senders that target a given recipient, i.e. the contacts 
of a user. The contact lists, or subsets thereof, can then 
be thought of as a signature of spam senders and recipi- 
ents. Additionally by constructing a similarity measure in 
these spaces we can track how lists evolve over time, by 
addition or removal of addresses. 

In this paper, we propose an algorithm for spam de- 
tection that uses structural relationships between senders 
and recipients as the basis for the identification of spam 
messages. The algorithm must work in conjunction with 



another spam classifier, necessary to produce spam or le- 
gitimate mail tags on past senders and receivers, which in 
turn are used to infer new ones through structural similar- 
ity (hereafter called: auxiliary algorithm), The key idea 
is that the lists spammers and legitimate users send mes- 
sages to, as well as the lists from which they receive mes- 
sages from can be used as the identifiers of classes of 
email traffic [19, 10]. We will show that the final result 
of the application of our structural algorithm over the de- 
terminations of the initial classifier leads to the correction 
of a number of misclassifications as false positives. 

This paper is organized as follows: Section 2 presents 
the methodology used to handle email data. Our struc- 
tural algorithm is described in Section 3. We present the 
characteristics of our example workload in section 4, as 
well as the classification results obtained with our algo- 
rithm over this set. Related work is presented in Section 5 
and conclusions and future work in Section 6. 

2 Modeling Similarity Among 
Email Senders and Recipients 

Our proposed spam detection algorithm exploits the struc- 
tural similarities that exist in groups of senders and recip- 
ients as well as in the relationship established through the 
emails exchanged between them. This section introduces 
our modeling of individual email users and a metric to ex- 
press the similarity existent among different users. It then 
extends the modeling to account for clusters of users who 
have great similarity. 

Our basic assumption is that, in both legitimate email 
and spam traffics, users have a defined list of peers they 
often have contact with (i.e., they send/receive an email 
to/from). In legitimate email traffic, contact lists are con- 
sequence of social relationships on which users' commu- 
nications are based. In spam traffic, on the other hand, 
the lists used by spammers to distribute their solicitations 
are created for business interest and, generally, do not re- 
flect any form of social interaction. A user's contact list 
certainly may change over time. However, we expect it 
to be much less variable than other characteristics com- 
monly used for spam detection, such as sender user-name, 
presence of certain keywords in the email content and en- 
coding rules. In other words, we expect contact lists to 



be more effective in identifying spams and, thus, we use 
them as the basis for developing our algorithm. 

We start by representing an email user as a vector in a 
multi-dimensional conceptual space created with all pos- 
sible contacts. We represent email senders and recipients 
separately. We then use vectorial operations to express the 
similarity among multiple senders (recipients), and use 
this metric for clustering them. Note that the term email 
user is used throughout this work to denote any identifi- 
cation of an email sender/recipient (e.g., email address, 
domain name, etc). 

Let N r be the number of distinct recipients. We repre- 
sent sender Sj as a N r dimensional vector, si, defined in 
the conceptual space created by the email recipients being 
considered. The n-th dimension (representing recipient 
r n ) of Si is defined as: 

MnJ " \ 0, otherwise ' ( } 

where Si — > r n indicates that sender Sj has sent at least 
one email to r„ recipient. 

Similarly, we define as a ^ dimensional vector rep- 
resentation for the recipient r^, where N s is the number 
of distinct senders being considered. The n-th dimension 
of this vector is set to 1 if recipient y-j has received at least 
one email from s n . 

We next define the similarity between two senders s, 
and sj as the cosine of the angle between their vector rep- 
resentation (s~l and Sj). The similarity is computed as fol- 
lows: 

S; O £■ 

sim(si,Sj) — = cos{Si,Sj), (2) 

| Si 1 1 Sj ; | 

where si o ij is the internal product of the vectors and 
| si | is the norm of si. Note that this metric varies from 0, 
when senders do not share any recipient in their contact 
lists, to 1, when senders have identical contact lists and 
thus have the same representation. The similarity between 
two recipients is defined similarly. 

We note that our similarity metric has different inter- 
pretations in legitimate and spam traffics. In legitimate 
email traffic, it represents social interaction with the same 
group of people, whereas in the spam traffic, a great sim- 
ilarity represents the use of different identifiers by the 
same spammer or the sharing of distribution lists by dis- 
tinct spammers. 
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Finally, we can use our vectorial modeling approach 
to represent a cluster of users (senders or recipients) who 
have great similarity. A sender cluster sa, represented 
by vector sc,, is computed as the vectorial sum of its ele- 
ments, that is: 

sci = ^2 s. (3) 

sGsCi 

The similarity between sender s, and an existing cluster 
scj can then be directly assessed by extending Equation 2 
as follows: 



sim(sCi, Si) 



cos(sCi - Si, si), if Si e s^ 
cos(sci,Si), otherwise 



(4) 



We note that a sender vectorial representation and thus 
the sender cluster to which it belongs (i.e., shares the 
greatest similarity) may change over time as new emails 
are considered. Therefore, in order to accurately estimate 
the similarity between a sender Sj and a sender cluster sci 
to which Si currently belongs, we first remove Sj from scj, 
and then take the cossine between the two vectors (scj — si 
and si). This is performed so that the previous classifica- 
tion of a user does not influence its reclassification. Re- 
cipient clusters and the similarity between a recipient and 
a given recipient cluster are defined analogously. 

3 A New Algorithm for Improving 
Spam Detection 

This section introduces our new email classification al- 
gorithm which exploits the similarities between email 
senders and between email recipients for clustering and 
uses historical properties of clusters to improve spam de- 
tection accuracy. Our algorithm is designed to work to- 
gether with any existing spamdetection and filtering tech- 
nique that runs at the ISP level. Our goal is to provide 
a significant reduction of false positives (i.e., legitimate 
emails wrongly classified as spam), which can be as high 
as 15% in current filters [2]. 

A description of the proposed algorithm is shown in 
Algorithm 1. It runs on each arriving email to, taking 
as input the classification of to, mClass, as either spam 
or legitimate email, performed by the existing auxiliary 
spam detection method. Using the vectorial representa- 
tion of email senders, recipients and clusters as well as 



the similarity metric defined in Section 2, it then deter- 
mines a new classification for to, which may or not agree 
with mClass. The idea is that the classification by the 
auxiliary method is used to build an incremental histori- 
cal knowledge base that gets more representative through 
time. Our algorithm benefits from that and outperforms 
the auxiliary one as shown in Section 4. 

for all arriving message to do 

mClass ^classification of m by auxiliary detection 
method; 

sc =find cluster for m. sender; 

Update spam probability for sc using mClass; 

P s (to) =spam probability for sc; 

P r (m) = 0; 

for all recipient r E m.recipients do 
rc =find cluster for r; 

Update spam probability for rc using mClass; 
P r (m) = P r (to) +spam probability for rc; 
end for 

P r (m) = P r (m) I size(m.recipients) 

SP(m) = compute spam rank based on P s (m) and 

Pr(m); 

if SP(m) > uj then 

classify to as spam; 
else if SP(m) < 1 — w then 

classify to as legitimate; 
else 

classify to as mClass; 
end if 
end for 

Algorithm 1 : New Algorithm for Email Classification 

In order to improve the accuracy of email classifica- 
tion, our algorithm maintains sets of sender and recipient 
clusters, created based on the structural similarity of dif- 
ferent users. A sender (recipient) of an incoming email is 
added to a sender (recipient) cluster that is most similar 
to it, as defined in Equation (4), provided that their sim- 
ilarity exceeds a given threshold r. Thus, r defines the 
minimum similarity a sender (recipient) must have with a 
cluster to be assigned to it. Varying r allows us to create 
more tightly or loosely knit clusters. If no cluster can be 
found, a new single-user cluster is created. In this case, 
the sender (recipient) is used as seed for populating the 
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new cluster. 

The sets of recipient and sender clusters are updated 
at each new email arrival based on the email sender and 
list of recipients. Recall that to determine the cluster a 
previously observed, and thus clustered, user (sender or 
recipient) belongs to, we first remove the user from his 
current cluster and then assess its similarity to each ex- 
isting cluster. Thus, single-user clusters tend to disappear 
as more emails are processed, except for users that appear 
only very sporadically. 




P s (m) P s (m) 



Figure 1 : Spam Rank Computation and Email Classifica- 
tion. 

A probability of sending (receiving) a spam is assigned 
to each sender (recipient) cluster. We refer to this measure 
as simply the cluster spam probability. We calculate the 
spam probability of a sender (recipient) cluster as the av- 
erage spam probability of its elements, which, in turn, is 
estimated based on the frequency of spams sent/received 
by each of them in the past. Therefore, our algorithm uses 
the result of the email classification performed by the aux- 
iliary algorithm on each arriving email m (mClass in Al- 
gorithm 1) to continuously update cluster spam probabil- 
ities. 

Let us define the probability of an email m being sent 
by a spammer, P s (m), as the spam probability of its 
sender's cluster. Similarly, let the probability of an email 
m being addressed to users that receive spam, P r (m), as 
the average spam probability of all of its recipients' clus- 
ters (see Algorithm 1). Our algorithm uses P s (m) and 
P r (m) to compute a number that expresses the chance 
of email m being spam. We call this number the spam 
rank of email to, denoted by SR(m). The idea is that 
emails with large values of P s (to) and P r (to) should have 
large spam ranks and thus should be classified as spams. 
Similarly, emails with small values of P s (to) and P r (to) 



should receive low spam rank and be classified as legiti- 
mate email. 

Figure 1 shows a graphical representation of the com- 
putation of an email spam rank. We first normalize the 
probabilities P s (m) and P r (m) by a factor of \/2, so 
that the diagonal of the square region defined in the bi- 
dimensional space is equal to 1 (see Figure 1-left). Each 
email to can be represented as a point in this square. The 
spam rank of to, SR(m), is then defined as the length of 
the segment starting at the origin (0,0) and ending at the 
projection of to on the diagonal of the square (see Fig- 
ure 1 -right). Note the spam rank varies between and 
1. 

The spam rank SR(m) is then used to classify m as fol- 
lows: if it is greater than a given threshold ui, the email is 
classified as spam; if it is smaller than 1 — ui, it is classified 
as legitimate email. Otherwise, we can not precisely clas- 
sify the email, and we rely on the initial classification pro- 
vided by the auxiliary detection algorithm. The parameter 
w can be tuned to determine the precision that we expect 
from our classification. Graphically, emails are classified 
according to the marked regions shown in Figure 1-left. 
The two triangles, with identical size and height u, repre- 
sent the regions where our algorithm is able to classify 
emails as either spam (upper right) or legitimate email 
(lower left). 

4 Experimental Results 

In this section we describe our experimental results. We 
first present some important details of our workload, fol- 
lowed by the quantitative results of our approach, com- 
pared to others. 

4.1 Workload 

Our email workload consists of anonymized and sanitized 
SMTP logs of incoming emails to a large university in 
Brazil, with around 22 thousand students. The server 
handles all emails coming from domains outside the uni- 
versity, sent to students, faculty and staff with email ad- 
dresses under the university's domain name 1 

'Only the emails addressed to two out of over 100 university sub- 
domains (i.e., departments, research labs, research groups) do not pass 
through the central server. 
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The central email server runs Exim email software [9], 
the Amavis virus scanner [1] and the Trendmicro Vscan 
anti-virus tool [18]. A set of pre-acceptance spam filters 
(e.g. black lists, DNS reversal) blocks about 50% of the 
total traffic received by the server. 

The messages not rejected by the pre-acceptance tests 
are directed to Spam- Assassin [17]. Spam- Assassin is a 
popular spam filtering software that detects spam mes- 
sages based on a changing set of user-defined rules. These 
rules assign scores to each email received based on the 
presence in the subject or in the email body of one or more 
pre-categorized keywords. Spam- Assassin also uses other 
rules based on message size and encoding. Highly ranked 
messages according to these criteria are flagged as spam. 

We analyze an eight-day log collected between 
01/19/2004 to 01/26/2004. Our logs store the header of 
each email (i.e. containing sender, recipients, size , date, 
etc.) that passes the pre-acceptance filters, along with the 
results of the tests performed by Spam-Assassin and the 
virus scanners. We also have the full body of the messages 
that were classified as spam by Spam-Assassin. Table 1 
summarizes our workload. 



Measure 


Non-Spam 


Spam 


Aggregate 


# of emails 


191,417 


173,584 


365,001 


Size of emails 


11.3 GB 


1.2 GB 


12.5 GB 


# of distinct senders 


12,338 


19,567 


27,734 


# of distinct recipients 


22,762 


27,926 


38,875 



Table 1 : Summary of the Workload 



By visually inspecting the list of sender user names 2 
in the spam component of our workload, we found that a 
large number of them corresponded to a seemingly ran- 
dom sequence of characters, suggesting that spammers 
tend to change user names as an evasion technique. There- 
fore, for the experiments presented below we identified 
the sender of a message by his/her domain while recipi- 
ents were identified by their full address, including both 
domain and user name. 

4.2 Classification Results 

The results shown in this section were obtained through 
the simulation of the algorithm proposed here over the set 

2 The part before @ in email addresses. 
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Figure 2: Number of Email User Clusters and Beta CV 

VS. T. 

of messages in our logs. The implementation of the simu- 
lator made use of an inverted lists [20] approach for stor- 
ing information about senders, recipients and clusters that 
is effective both in terms of memory and processing time. 
Our simulations were executed on a commodity worksta- 
tion (Intel Pentium ®4 - 2.80GHz - with 500MBytes) and 
the simulator was able to classify 20 messages per sec- 
ond. This is far faster than the average rate with which 
messages usually arrive and than the peak rate observed 
over the workload collection time [11]. 




(a) Bin size = 0. 10 (b) Bin size = 0.25 

Figure 3: Number of Spam Messages by Varying Message 
Spam Probabilities for Different Bin Sizes. 

The number and quality of the clusters generated 
through our similarity measure are the direct result of the 
chosen value for the threshold r (see Section 3). In order 
to determine the best parameter value the simulation was 
executed several times for varying r. 

Figure 2 shows how the number of clusters and beta 
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CV vary with r. There is one clear point of stabilization 
of the curve (i.e. a plateau) at r = 0.5 and that is the 
value we adopt for the remaining of the paper. Although 
other stabilization points occur for values of r above 0.5, 
the lowest of such values seems to be the most appropriate 
for our experiments. The reason for that is that this value 
of r is the one that generates the smaller stable number of 
clusters, i.e. cluster with more elements, and that allows 
us to evaluate better the beneficial effects that clustering 
senders and recipients may have. Moreover, while ana- 
lyzing the beta CV we are able to see that the quality of 
the clustering for all values r > 0.4 is approximately the 
same. 

One of the hypothesis of our algorithm is that we can 
group spam messages in terms of the probabilities P s (m) 
and P r (m). Figure 3 shows the fraction of spam mes- 
sages that exist for different values of P s (m) and P r (m) 
grouped based on a discretization of the full space rep- 
resented in the plot. The full space is subdivided into 
smaller squares of the same size called bins. Clearly, 
spam/legitimate messages are indeed located in the re- 
gions (top and bottom respectively) as we have hypoth- 
esized in Section 3. There is however a region in the mid- 
dle where we can not determine the classification for the 
messages based on the computed probabilities. This is 
why it becomes necessary to vary u). One should adjust ui 
based on the level of confidence he/she has on the auxil- 
iary algorithm. 

Figure 3 shows that differentiation between senders and 
recipients for detecting spam can be more effective than 
the simple choice we use in this paper. Messages ad- 
dressed to recipients that have high P r (m) tend to be 
spam more frequently than messages with the same value 
of P s (m). Analogously, messages with low P s (m) have 
higher probability of being legitimate messages. Ways of 
using this information in our algorithm are an ongoing re- 
search effort that we intend to pursue in future extensions. 

Our algorithm makes use of an auxiliary spam detec- 
tion algorithm - such as SpamAssassin. Therefore, we 
need to evaluate how frequently we maintain the same 
classification as such an algorithm. Figure 4 shows the 
the percentage of messages that received the same classi- 
fication and the total number of classified messages in our 



simulation by varying u>. The difference between these 
curves is the set of messages that were classified differ- 
ently from the original classification provided. There is a 
clear tradeoff between the total number of messages that 
are classifiable and the accordance with the previous clas- 
sification provided by the original classifier algorithm. 



100 




3 Beta CV means intra CV/inter CV and assesses the quality of the 
clusters generated. The lower the beta CV the better quality in terms of 
grouping obtained [15]. 



Figure 4: Messages Classified in Accordance With to the 
Auxiliary Algorithm and the Total Number of Messages 
Classified by Varying uo 

In another experiment, we simulated a different algo- 
rithm that also makes use of history information provided 
by an auxiliary spam detector described in [19]. This ap- 
proach tries to classify messages based on the historical 
properties of their senders. We built a simulator for this 
algorithm and executed it against our data set. The results 
show that it was able to classify 85.11% of the messages 
in accordance with the auxiliary algorithm. Its important 
to note that, on the other hand, our algorithm can be tuned 
by the proper set of threshold u>. The higher the parameter 
u> the more in acordance with the auxiliary classification 
the classification of our algorithm is. 

We believe that the differences between the original 
classification and the classification proposed for high u 
values generally are due to missclassifications by the aux- 
iliary algorithm. In our data set we have access to the 
full body of the messages that were originally classified 
as spam. Therefore, we can evaluate a fraction of the to- 
tal amount of false positives (messages that the auxiliary 
algorithm classify as spam and our algorithm classify as 
legitimate message) that were generated by the auxiliary 
algorithm. This is important since there is a common be- 
lief that the cost of false positives is higher than the cost 
of false negatives [6]. 

Each of the possible false positives were manually eval- 
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uated by three people so as to determine whether such a 
message was indeed spam. Table 2 summarizes the re- 
sults for ui = 0.85, 879 messages were manually ana- 
lyzed (0.24% of the total of messages). Our algorithm 
outperforms the original classification since it generates 
less false positives. We emphasize that we can not sim- 
ilarly determine the quality of classification for the mes- 
sages classified as legitimate by the auxiliary algorithm 
since we do not have access to the full body of those mes- 
sages. Due to the cost of manually classifying messages 
we can not aford to classify all of the messages classified 
as spam by the auxiliary algorithm. 



Algorithm 


% of Missclassifications 


Original Classification 


60.33% 


Our approach 


39.67% 



Table 2: Possible False Positives Generated by the Ap- 
proaches Studied. 



machines that behave as spam senders by analyzing a bor- 
der flow graph of sender and recipient machines. In[19], 
the authors propose a new scheme for handling spam. It 
is a post-acceptance mechanism that processes mail sus- 
pected of being spam at reduced priority, when compared 
to the priority assigned to messages classified as legiti- 
mate. The proposed mechanism[19] works in conjunction 
with some sort of mail filter that provides past history of 
mails received by a server. 

None of the existing spam filtering mechanisms are 
infallible[19, 6]. Their main problems are false positive 
and wrong mail classification. In addition to those prob- 
lems, filters must be continuously updated to capture the 
multitude of mechanism constantly introduced by spam- 
mers to avoid filtering actions. The algorithm presented 
in this paper aims at improving the effectiveness of spam 
filtering mechanisms, by reducing false positives and by 
providing information that help those mechanism to tune 
their collection of rules. 



5 Related Work 

Previous work have focused on reducing the impact of 
spam. The approaches to reduce spam can be categorized 
into pre-acceptance and post-acceptance methods, based 
on whether they detect and block spam before or after 
accepting messages. Examples of pre-acceptance meth- 
ods are black lists [14], gray lists [12], server authentica- 
tion [7, 3] and accountability [5]. Post-acceptance meth- 
ods are mostly based on information available in the body 
of the messages and include Bayesian filters [16], collab- 
orative filtering [21]. 

Recent papers have focused on spam combat tech- 
niques based on characteristics of graph models of email 
traffic [4, 8]. The techniques used try to model email 
traffic as a graph and detect spam and spam attacks re- 
spectively in terms of graph properties. In [4] a graph is 
created representing the email traffic captured in the mail- 
box of individual users. The subsequent analysis is based 
on the fact that such a network possesses several discon- 
nected components. The clustering coefficient of each of 
these components is then used to characterize messages 
as spam or legitimate. Their results show that 53% of the 
messages were precisely classified using the proposed ap- 
proach. In [8] the authors used the approach of detecting 



6 Conclusions and Future Work 

In this paper we proposed a new spam detection algorithm 
based on the structural similarity between contact lists of 
email users. The idea is that contact lists, integrated over a 
suitable amount of time, are much more stable identifiers 
of email users than id names, domains or message con- 
tents, which can all be made to vary quickly and widely. 
The major drawback of our approach is that our algorithm 
can only group users based on their structural similarity, 
but has no way of determining by itself if such vector clus- 
ters correspond to spam or legitimate email. Because of 
this feature it must work in tandem with an original clas- 
sifier. Given this information we have shown that we can 
successfully group spam and legitimate email users sep- 
arately and that this structural inference can improve the 
quality of other spam detection algorithms. 

Specifically we have implemented a simulator based 
on data collected from the main SMTP server for a ma- 
jor university in Brazil that uses SpamAssassin. We have 
shown that our algorithm can be tuned to produce classifi- 
cations similar to those of the original classifier algorithm 
and that, for a certain set of parameters, is was capable of 
correcting false positives generated by SpamAssassin in 
our workload. 
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There are several improvements and developments that 
were not explored here, but promise to reinforce the 
strength of our approach. We intend to explore these in fu- 
ture work. We observe that structural similarity gives us a 
basis for time correlation of similar addresses, and as such 
to follow the time evolution of spam sender techniques, in 
ways that suitably factor out the enormous variability of 
their apparent identifiers. Finally we note that the proba- 
bilistic basis of our approach lends itself naturally to the 
evolution of users' classifications (say through Bayesian 
inference), both through collaborative filtering using user 
feedback and from information derived from other algo- 
rithmic classifiers. 
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